Vector是R語言裡面最基礎的資料型態,像是在統計裡面的ratio data, interval data等。不同的資料型態會有不同的應用 情境。我們將會在下面的chunk介紹這六種不同的vector。
is.integer(1L) # 0, -1, 1, -2 ,2... - integer
## [1] TRUE
is.numeric(1.358) # number with digits - numeric
## [1] TRUE
is.complex(1 + 7i) # number with i - complex
## [1] TRUE
is.character("marketing") # string - characters
## [1] TRUE
is.logical(TRUE) # TURE or FALSE - logical(boolean)
## [1] TRUE
is.factor(factor(c("Marketing", "Accounting"))) # some labels - factor
## [1] TRUE
R提供了完整的數值比較運算模式,與一般使用的數學符號相同
5 > 3 # greater
## [1] TRUE
5 < 7 # less
## [1] TRUE
7 == 8 # equal to
## [1] FALSE
7 != 8 # not equal to
## [1] TRUE
5 >= 5 # greater or equal to
## [1] TRUE
8 <= 10 # less than or equal to
## [1] TRUE
!(7 > 8) # inverse logical expression
## [1] TRUE
除了數值比較邏輯之外,整合and與or等邏輯可以使邏輯的變化性更加多元,更符合 一般使用的情境
5 > 3 & 8 > 9 # and
## [1] FALSE
5 > 3 | 8 > 9 # or
## [1] TRUE
xor(TRUE, TRUE) # return TRUE if only one of two expression is TRUE
## [1] FALSE
xor(TRUE, FALSE)
## [1] TRUE
5 %in% 5 : 10 # show if 5 belongs to the vector, 5 to 10
## [1] TRUE
4 %between% c(3, 5) # show if 4 locates between 3 and 5
## [1] TRUE
Logical在R裡面視為一種特殊數值的型態,可以作為加減的用途,TRUE實質上在紀錄的時候是1而FALSE為0,因此實務上 logical的加減是被允許而且常被用來統計符合特定條件的值的個數。
TRUE + TRUE # Equal to 1 + 1
## [1] 2
TRUE - FALSE # Equal to 1 - 0
## [1] 1
TRUE == 1 # Equal to ask whether 1 == 1
## [1] TRUE
FALSE == 0 # Equal to ask whether 0 == 0
## [1] TRUE
TW <- 1 : 5
sum(TW > 3) # Equal to FALSE + FALSE + FALSE + TRUE + TRUE or 0 + 0 + 0 + 1 + 1
## [1] 2
List是一種可以包含不同vector型態的一種資料格式,擁有不容易受到改動的特性。Matrix是一種二維的資料型態,通常用於 向量化的科學運算。
list(1, 1.358, 1 + 7i, "marketing")
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.358
##
## [[3]]
## [1] 1+7i
##
## [[4]]
## [1] "marketing"
matrix(1 : 12, nrow = 4, ncol = 3) # A 4 by 3 matrix
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
matrix(1 : 4, 2, 2) + matrix (5 : 8, 2, 2) # matrix addition
## [,1] [,2]
## [1,] 6 10
## [2,] 8 12
matrix(1 : 4, 2, 2) * matrix(5 : 8, 2, 2) # element-wise multiplication
## [,1] [,2]
## [1,] 5 21
## [2,] 12 32
matrix(1 : 4, 2, 2) %*% matrix(5 : 8, 2, 2) # general matrix multiplication
## [,1] [,2]
## [1,] 23 31
## [2,] 34 46
t(matrix(1 : 4, 2, 2)) # t stands for transpose
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
diag(matrix(1 : 4, 2, 2)) # diag function allows you to construct elements of dignoal or extract diagnol elements
## [1] 1 4
diag(3) # creating identity matrix by n
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
det(matrix(1 : 4, 2, 2)) # obtaing determinant
## [1] -2
solve(matrix(1 : 4, 2, 2)) # this allows you to obtain inverse matrix
## [,1] [,2]
## [1,] -2 1.5
## [2,] 1 -0.5
matrix(1 : 4, 2, 2) %*% solve(matrix(1 : 4, 2, 2)) # checking whether the inverse matrix is correct
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
c function是一個能串接資料的函數
c(1, 2, 3, 4, 5)
## [1] 1 2 3 4 5
c(6, 7, 8, 9, 10)
## [1] 6 7 8 9 10
c(c(1, 2, 3), c(4, 5, 6))
## [1] 1 2 3 4 5 6
c("Dr.", "Chen", "is", "brilliant")
## [1] "Dr." "Chen" "is" "brilliant"
c(1 + 8i, 0 + 3i, 4 + 99i)
## [1] 1+ 8i 0+ 3i 4+99i
實務上,我們常常需要使用相同的資料來做不同的運算,如果每次都需要重新計算一次同樣的數值會造成冗雜的過程與 計算上的負擔,因此我們可以將計算完的數值或資料儲存在某一個變數裡面,在後來的計算與使用上可以更加便捷的呼喚 同一個數值。“<-” 與 “->”在R語言裡頭被定義為assign符號,一般使用上“=”一樣可以得到同樣的效果,但是在某些特殊 情境之下會造成錯誤,因此不建議使用“=”來替代assign符號。
NCHU <- c(1, 2, 3, 4) # "<-", "->" are termed as assign symbol
c(1, 2, 3, 4) -> UCCU # it works as previous one
print(NCHU) # print function allows you to print out the content in the variable
## [1] 1 2 3 4
identical(NCHU, UCCU) # identical function allows you to check whether the values of two variables are identical
## [1] TRUE
Clarkson = c(1, 2, 3 ,4) # "=" generally can yield the same output
print(Clarkson)
## [1] 1 2 3 4
NCHU + 1 # we could do arithmeric operation to those variables directly
## [1] 2 3 4 5
Marketing <- NCHU ^ 2 # We could also assign value to second variable after manipulating the first variable
print(Marketing) # "^" denotes square
## [1] 1 4 9 16
基礎函數 注意 : R是一個對於大小寫有區分的程式語言,大小寫不同會被視為不同的變數
Oreo <- sample(1 : 10, 10, replace = FALSE) # sample function allows you to select numbers randomly
print(Oreo) # Let us check the content of variable
## [1] 3 5 4 10 6 8 9 7 2 1
length(Oreo) # length function will return the number of elements in a variable
## [1] 10
sort(Oreo) # If a variable is numeric or integer type, sort function would allow you to sort them either or increasingly decreasingly
## [1] 1 2 3 4 5 6 7 8 9 10
order(Oreo) # order function returns the position of the elements either in increasing or decreasing order
## [1] 10 9 1 3 2 5 8 6 7 4
rev(Oreo) # Conspicuously, it merely reverses the vector
## [1] 1 2 7 9 8 6 10 4 5 3
kitkat <- sample(1 : 5, 6, replace = TRUE)
print(kitkat)
## [1] 4 4 2 3 3 4
unique(kitkat)
## [1] 4 2 3
duplicated(kitkat)
## [1] FALSE TRUE FALSE FALSE TRUE TRUE
Oreo[3] # We could call specific elements by adding two braket
## [1] 4
Oreo[c(1, 3, 5, 7, 9)] # multiple selection is allowed
## [1] 3 4 6 9 2
5 > 3 # Logicial statement could generate logical value
## [1] TRUE
Oreo[Oreo > 5] # Logical value could be used to select values in variables
## [1] 10 6 8 9 7
Oreo[Oreo > 3 & Oreo <= 8] # multiple filters are viable, & stands for "and"
## [1] 5 4 6 8 7
Oreo[Oreo > 5 | Oreo == 3] # | stands for "or"
## [1] 3 10 6 8 9 7
which.max(Oreo) # which series can return the position of given elements
## [1] 4
which.min(Oreo)
## [1] 10
which(c(TRUE, FALSE, TRUE)) # which function returns the position of elements which are TRUE
## [1] 1 3
Oreo[which.max(Oreo)] # Combining these two kind of operations
## [1] 10
Oreo[which.min(Oreo)]
## [1] 1
if statement是一種電腦對於boolean值判定後執行特定動作的語法,常常使用在各種語法情境之中
Newton <- 5
if (Newton > 7) {
print("good")
}
if (Newton > 3) {
print(Newton)
}
## [1] 5
if (Newton > 5) {
print("Good")
} else {
print("Bad")
}
## [1] "Bad"
Coke <- "pikachu"
if (Coke == "squirtle") {
print("squirtle")
} else if (Coke == "charmander") {
print("Charmander")
} else if (Coke == "pikachu") {
print("pikachu")
}
## [1] "pikachu"
除了if statement之外,for loop及while loop也是在任何programming language裏頭無可或缺的重要工具
Init <- 1 # We first assign a initial value
for (i in 1 : 5) {
Init <- Init + 1
print(Init)
}
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
print(Init) # checking the output
## [1] 6
Init <- 1 # Reassinging initial value
for (i in 1 : 5) {
Init <- Init + i
print(Init)
}
## [1] 2
## [1] 4
## [1] 7
## [1] 11
## [1] 16
while loop 提供使用者另外一個選項,但必須要注意的是錯誤的使用While loop有可能導致infinite loop
Init <- 1 # As steps above
while (Init < 10) {
Init <- Init + 1
}
print(Init) # checking final output
## [1] 10
R是一個function base的程式語言,function是使整個R語言運作的重要工具,使用者可以自己定義常用的 運算並儲存成一個自定義function並可以呼叫它來運作
square <- function(x) {
return (x ^ 2)
}
square(5)
## [1] 25
is_even <- function(x) {
if (x %% 2 == 0) {
return (TRUE)
} else {
return (FALSE)
}
}
is_even(13)
## [1] FALSE
No_Arg <- function() {
print("Argument is not necessary")
}
No_Arg()
## [1] "Argument is not necessary"
Arb_Arg <- function(x, y, z, r = TRUE) {
print(sort(c(x, y, z), decreasing = r))
}
Arb_Arg(3, 99, 1) # we create a more complex function and assign a default argument
## [1] 99 3 1
Arb_Arg(3, 99, 1, FALSE) # We can assign value to the default argument
## [1] 1 3 99
Arb_Arg(r = FALSE, 3, 99, 1) # We can swap the position by pointing out the name of argument
## [1] 1 3 99
Detector <- function(user) {
if (user == "teacher") {
print("Authenticated")
} else if (user == "student") {
print("Anauthenticated")
} else {
print("Unauthenticated")
}
}
Detector("teacher")
## [1] "Authenticated"
Detector("student")
## [1] "Anauthenticated"
Detector("Bad guy")
## [1] "Unauthenticated"
Detector("Teacher")
## [1] "Unauthenticated"
接下來我們將會用稍微複雜一點的fibonacci sequence作為function的例子
fibonacci <- function(len) {
# computes fibonacci sequence based on given length of the sequence
#
# Args :
# len : Length of the fibonacci sequence
#
# Returns : a well-constructed fibonacci sequence
# We set a checker here to prevent from setting unexpected parameter to the argument "len"
# The checker is assessing if the argument "len" return a value which is less than 0 or
# the value is not an integer. Since both cases do not make sense in any scenarios.
if (len <= 0 | len %% 1 != 0) {
stop("The argument len is smaller than 0 or not an integer")
}
# Preallocation of target vector allows us to reduce ram usage and processing time.
Initial <- vector("integer", length = len)
Initial[1 : length(Initial)] <- 1 # The given initial points of fibonacci sequence
# Since it is unnecessary to calculate the following number in the sequence if the assigned length is
# less than 3. We just merely set a checker here and return 1 or 1, 1 while the length is 1 or 2.
# Otherwise, we build a for loop to calculate the following sequence based on fibonacci rule.
if (len < 3) {
return (Initial[1 : len])
} else {
for (i in 1 : (len - 2)) {
Initial[i + 2] <- sum(Initial[i : (i + 1)])
}
return (Initial)
}
}
(Optional)接下來要介紹的是進階的apply系列的function,成員有sapply、lapply、apply…等,這個系列的function 相較於前面的function都要更加抽象,需要一定的時間去適應它的邏輯與概念,因此這部分是屬於附加的內容。若有興趣 可以在完成R基本的tutorial之後再回來看幾這個function。
x <- c(1, 4, 9, 16, 25)
result_sapply <- sapply(x, sqrt)
result_lapply <- lapply(x, sqrt)
class(result_sapply)
## [1] "numeric"
class(result_lapply)
## [1] "list"
print(result_sapply)
## [1] 1 2 3 4 5
print(result_lapply)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
##
## [[4]]
## [1] 4
##
## [[5]]
## [1] 5
result_sapply_notsim <- sapply(x, sqrt, simplify = FALSE)
identical(result_lapply, result_sapply_notsim)
## [1] TRUE
y <- 1 : 5
virtual_output1 <- sapply(y, function(k) {
k ^ 2
})
virtual_output2 <- lapply(y, function(p) {
(p / 5) + 1
})
print(virtual_output1)
## [1] 1 4 9 16 25
print(virtual_output2)
## [[1]]
## [1] 1.2
##
## [[2]]
## [1] 1.4
##
## [[3]]
## [1] 1.6
##
## [[4]]
## [1] 1.8
##
## [[5]]
## [1] 2
sapply(1 : 5, function(k) {
paste0("student", k)
})
## [1] "student1" "student2" "student3" "student4" "student5"
接下來要介紹如何從外部讀入資料進R,除了這些方法之外還有很多其他的方法,但是這些方法在這個tutorial範圍之外
kie <- read.csv("iris.csv") # for comma seperated file
head(kie)
## X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 1 5.1 3.5 1.4 0.2 setosa
## 2 2 4.9 3.0 1.4 0.2 setosa
## 3 3 4.7 3.2 1.3 0.2 setosa
## 4 4 4.6 3.1 1.5 0.2 setosa
## 5 5 5.0 3.6 1.4 0.2 setosa
## 6 6 5.4 3.9 1.7 0.4 setosa
zoo <- read_excel("Titanic.xlsx") # for Excel file
head(zoo)
## # A tibble: 6 x 6
## X__1 Class Sex Age Survived Freq
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1 1st Male Child No 0
## 2 2 2nd Male Child No 0
## 3 3 3rd Male Child No 35.0
## 4 4 Crew Male Child No 0
## 5 5 1st Female Child No 0
## 6 6 2nd Female Child No 0
其他讀入資料的方法,因為此進階function來自package(data.table),僅為展示平行讀取的差異
## test replications elapsed relative user.self sys.self
## 1 Built-in 1 39.957 106.552 39.287 0.570
## 3 Fread_With_Multi 1 0.375 1.000 1.207 0.208
## 2 Fread_With_Single 1 0.802 2.139 0.676 0.124
## user.child sys.child
## 1 0 0
## 3 0 0
## 2 0 0
data.frame是R裡面二維資料儲存最重要的資料型態,基本上與Excel的sheet型態類似。
summary(iris) # obtain the basic information of a data.frame
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
str(iris) # obtain another important information about data.frame
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(iris) # obtain the number of row and column
## [1] 150 5
nrow(iris) ; ncol(iris) # another way to obtain numbers of row and column
## [1] 150
## [1] 5
head(iris) #
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Basic operations of data.frame
iris[1 : 5, 1 : 3] # the number before comma refers to the row number and the number after comma refers to the column number
## Sepal.Length Sepal.Width Petal.Length
## 1 5.1 3.5 1.4
## 2 4.9 3.0 1.4
## 3 4.7 3.2 1.3
## 4 4.6 3.1 1.5
## 5 5.0 3.6 1.4
iris$Sepal.Length # $ sign allows you to extract given column value, and this is also a vector
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4
## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5
## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0
## [52] 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8
## [69] 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4
## [86] 6.0 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8
## [103] 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
## [120] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [137] 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
iris$Sepal.Length[1 : 10]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9
iris[iris$Sepal.Length > 7,] # We can use logical values to filter data.frame. If we left either part of the argument before comma and the argument after comma blank, it implies that all rows or columns are selected
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 103 7.1 3.0 5.9 2.1 virginica
## 106 7.6 3.0 6.6 2.1 virginica
## 108 7.3 2.9 6.3 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 118 7.7 3.8 6.7 2.2 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 126 7.2 3.2 6.0 1.8 virginica
## 130 7.2 3.0 5.8 1.6 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 136 7.7 3.0 6.1 2.3 virginica
iris$IamNewbee <- 1 # you can broadcast a single value or multiple value to the columns in data.frame
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species IamNewbee
## 1 5.1 3.5 1.4 0.2 setosa 1
## 2 4.9 3.0 1.4 0.2 setosa 1
## 3 4.7 3.2 1.3 0.2 setosa 1
## 4 4.6 3.1 1.5 0.2 setosa 1
## 5 5.0 3.6 1.4 0.2 setosa 1
## 6 5.4 3.9 1.7 0.4 setosa 1
iris$IamNewbee <- c(1, 2)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species IamNewbee
## 1 5.1 3.5 1.4 0.2 setosa 1
## 2 4.9 3.0 1.4 0.2 setosa 2
## 3 4.7 3.2 1.3 0.2 setosa 1
## 4 4.6 3.1 1.5 0.2 setosa 2
## 5 5.0 3.6 1.4 0.2 setosa 1
## 6 5.4 3.9 1.7 0.4 setosa 2
如果你想在一行裡面執行許多的計算,那麼便會使整個程式碼複雜難懂,在後來debug的過程中也不容易發現錯誤, pipeline使R能夠簡單易懂的寫出可讀性高的程式碼,下面為重重包裹的程式碼與使用pipeline寫出來的程式碼,可以 發現兩者的結果是相同的,但是pipeline的程式碼使我們更好理解到底程式碼執行了什麼樣的計算
tarvec <- rnorm(10, 0, 1)
example1 <- sd(sqrt(tarvec + 10)) # This is something that you want to prevent from happening.
example2 <- (tarvec + 10) %>% sqrt %>% sd # Pipeline enhances the readbility
identical(example1, example2)
## [1] TRUE
dplyr package提供六種在資料處理時十分常用的六個function分別是filter, select, arrange, mutate, summarise, and group
iris %>% filter(Sepal.Length > 7) # filter by Sepal.Length > 7
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species IamNewbee
## 1 7.1 3.0 5.9 2.1 virginica 1
## 2 7.6 3.0 6.6 2.1 virginica 2
## 3 7.3 2.9 6.3 1.8 virginica 2
## 4 7.2 3.6 6.1 2.5 virginica 2
## 5 7.7 3.8 6.7 2.2 virginica 2
## 6 7.7 2.6 6.9 2.3 virginica 1
## 7 7.7 2.8 6.7 2.0 virginica 1
## 8 7.2 3.2 6.0 1.8 virginica 2
## 9 7.2 3.0 5.8 1.6 virginica 2
## 10 7.4 2.8 6.1 1.9 virginica 1
## 11 7.9 3.8 6.4 2.0 virginica 2
## 12 7.7 3.0 6.1 2.3 virginica 2
iris %>% filter(Sepal.Length > 7 & Sepal.Width > 3) # filter by multiple features
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species IamNewbee
## 1 7.2 3.6 6.1 2.5 virginica 2
## 2 7.7 3.8 6.7 2.2 virginica 2
## 3 7.2 3.2 6.0 1.8 virginica 2
## 4 7.9 3.8 6.4 2.0 virginica 2
iris %>% filter(Sepal.Length > 7) %>% select(1 : 4) # select allows you to pick given columns
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 7.1 3.0 5.9 2.1
## 2 7.6 3.0 6.6 2.1
## 3 7.3 2.9 6.3 1.8
## 4 7.2 3.6 6.1 2.5
## 5 7.7 3.8 6.7 2.2
## 6 7.7 2.6 6.9 2.3
## 7 7.7 2.8 6.7 2.0
## 8 7.2 3.2 6.0 1.8
## 9 7.2 3.0 5.8 1.6
## 10 7.4 2.8 6.1 1.9
## 11 7.9 3.8 6.4 2.0
## 12 7.7 3.0 6.1 2.3
iris %>% filter(Sepal.Length > 7) %>% select(-5) # same output as the one above
## Sepal.Length Sepal.Width Petal.Length Petal.Width IamNewbee
## 1 7.1 3.0 5.9 2.1 1
## 2 7.6 3.0 6.6 2.1 2
## 3 7.3 2.9 6.3 1.8 2
## 4 7.2 3.6 6.1 2.5 2
## 5 7.7 3.8 6.7 2.2 2
## 6 7.7 2.6 6.9 2.3 1
## 7 7.7 2.8 6.7 2.0 1
## 8 7.2 3.2 6.0 1.8 2
## 9 7.2 3.0 5.8 1.6 2
## 10 7.4 2.8 6.1 1.9 1
## 11 7.9 3.8 6.4 2.0 2
## 12 7.7 3.0 6.1 2.3 2
iris %>% filter(Sepal.Length > 7) %>% arrange(Sepal.Length) # arrange allows you to rearrange your row order
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species IamNewbee
## 1 7.1 3.0 5.9 2.1 virginica 1
## 2 7.2 3.6 6.1 2.5 virginica 2
## 3 7.2 3.2 6.0 1.8 virginica 2
## 4 7.2 3.0 5.8 1.6 virginica 2
## 5 7.3 2.9 6.3 1.8 virginica 2
## 6 7.4 2.8 6.1 1.9 virginica 1
## 7 7.6 3.0 6.6 2.1 virginica 2
## 8 7.7 3.8 6.7 2.2 virginica 2
## 9 7.7 2.6 6.9 2.3 virginica 1
## 10 7.7 2.8 6.7 2.0 virginica 1
## 11 7.7 3.0 6.1 2.3 virginica 2
## 12 7.9 3.8 6.4 2.0 virginica 2
iris %>% filter(Sepal.Length > 7) %>% arrange(-Sepal.Length) # decreasing order instead of ascending
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species IamNewbee
## 1 7.9 3.8 6.4 2.0 virginica 2
## 2 7.7 3.8 6.7 2.2 virginica 2
## 3 7.7 2.6 6.9 2.3 virginica 1
## 4 7.7 2.8 6.7 2.0 virginica 1
## 5 7.7 3.0 6.1 2.3 virginica 2
## 6 7.6 3.0 6.6 2.1 virginica 2
## 7 7.4 2.8 6.1 1.9 virginica 1
## 8 7.3 2.9 6.3 1.8 virginica 2
## 9 7.2 3.6 6.1 2.5 virginica 2
## 10 7.2 3.2 6.0 1.8 virginica 2
## 11 7.2 3.0 5.8 1.6 virginica 2
## 12 7.1 3.0 5.9 2.1 virginica 1
iris %>% filter(Sepal.Length > 7) %>% arrange(Sepal.Length, Sepal.Width) # arrange by two features
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species IamNewbee
## 1 7.1 3.0 5.9 2.1 virginica 1
## 2 7.2 3.0 5.8 1.6 virginica 2
## 3 7.2 3.2 6.0 1.8 virginica 2
## 4 7.2 3.6 6.1 2.5 virginica 2
## 5 7.3 2.9 6.3 1.8 virginica 2
## 6 7.4 2.8 6.1 1.9 virginica 1
## 7 7.6 3.0 6.6 2.1 virginica 2
## 8 7.7 2.6 6.9 2.3 virginica 1
## 9 7.7 2.8 6.7 2.0 virginica 1
## 10 7.7 3.0 6.1 2.3 virginica 2
## 11 7.7 3.8 6.7 2.2 virginica 2
## 12 7.9 3.8 6.4 2.0 virginica 2
iris %>% filter(Sepal.Length > 7) %>% select(-5) %>% mutate(bonus.length = Sepal.Length + 1) # mutate allows you to generate new features
## Sepal.Length Sepal.Width Petal.Length Petal.Width IamNewbee
## 1 7.1 3.0 5.9 2.1 1
## 2 7.6 3.0 6.6 2.1 2
## 3 7.3 2.9 6.3 1.8 2
## 4 7.2 3.6 6.1 2.5 2
## 5 7.7 3.8 6.7 2.2 2
## 6 7.7 2.6 6.9 2.3 1
## 7 7.7 2.8 6.7 2.0 1
## 8 7.2 3.2 6.0 1.8 2
## 9 7.2 3.0 5.8 1.6 2
## 10 7.4 2.8 6.1 1.9 1
## 11 7.9 3.8 6.4 2.0 2
## 12 7.7 3.0 6.1 2.3 2
## bonus.length
## 1 8.1
## 2 8.6
## 3 8.3
## 4 8.2
## 5 8.7
## 6 8.7
## 7 8.7
## 8 8.2
## 9 8.2
## 10 8.4
## 11 8.9
## 12 8.7
iris %>% summarise(zoo = mean(Sepal.Length), hospital = median(Sepal.Width),
spring = n(), applepie = sd(Petal.Width))
## zoo hospital spring applepie
## 1 5.843333 3 150 0.7622377
iris %>% group_by(Species) %>% summarise(morgan = mean(Sepal.Length), nick = median(Sepal.Width), semi = sd(Petal.Width), bob = n())
## # A tibble: 3 x 5
## Species morgan nick semi bob
## <fct> <dbl> <dbl> <dbl> <int>
## 1 setosa 5.01 3.40 0.105 50
## 2 versicolor 5.94 2.80 0.198 50
## 3 virginica 6.59 3.00 0.275 50
student’s t-Test
zoe <- sample(1 : 50, 100, replace = TRUE)
t.test(zoe, mu = 25) # t-test require the population whose distribution
##
## One Sample t-test
##
## data: zoe
## t = -0.19185, df = 99, p-value = 0.8483
## alternative hypothesis: true mean is not equal to 25
## 95 percent confidence interval:
## 21.93755 27.52245
## sample estimates:
## mean of x
## 24.73
# is normal distribution; though, it is robust to
# those violations
lucy <- sample(1 : 50, 100, replace = TRUE)
t.test(zoe, lucy, paired = TRUE)
##
## Paired t-test
##
## data: zoe and lucy
## t = -1.038, df = 99, p-value = 0.3018
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.085111 1.905111
## sample estimates:
## mean of the differences
## -2.09
aa <- rnorm(20, 5, 1) # generate normally distributed sequence with mean = 5
# and standard deviation = 1
bb <- rnorm(20, 5, 1)
t.test(aa, bb, var.equal = TRUE)
##
## Two Sample t-test
##
## data: aa and bb
## t = 0.23766, df = 38, p-value = 0.8134
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5747974 0.7277093
## sample estimates:
## mean of x mean of y
## 5.150241 5.073785
cc <- rnorm(20, 15 ,1)
dd <- rnorm(20, 5, 5)
var.test(cc, dd) # Variance-test is vulnerable to the violation of the
##
## F test to compare two variances
##
## data: cc and dd
## F = 0.068237, num df = 19, denom df = 19, p-value = 2.558e-07
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.02700888 0.17239643
## sample estimates:
## ratio of variances
## 0.0682366
# assumption that the population is normally distributed
One-Way-ANOVA
value <- lapply(c(10, 30, 35 ,70), function(k) {
rnorm(10, k, sd = sample(1 : 10, 1))
}) %>% do.call(c, .)
data_ANOVA <- data.frame(value = value, group = factor(rep(1 : 4, each = 10)))
head(data_ANOVA, 20)
## value group
## 1 3.859601 1
## 2 7.969302 1
## 3 12.238604 1
## 4 16.371370 1
## 5 11.004331 1
## 6 9.462165 1
## 7 9.014391 1
## 8 8.343835 1
## 9 21.829830 1
## 10 15.913142 1
## 11 33.726421 2
## 12 38.959332 2
## 13 20.476531 2
## 14 35.797314 2
## 15 22.809110 2
## 16 15.729133 2
## 17 49.029159 2
## 18 30.265285 2
## 19 28.103523 2
## 20 49.366947 2
aov_result <- aov(value ~ group, data = data_ANOVA)
summary(aov_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 3 17557 5852 144.8 <2e-16 ***
## Residuals 36 1455 40
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(aov_result)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = value ~ group, data = data_ANOVA)
##
## $group
## diff lwr upr p adj
## 2-1 20.825618 13.168904 28.48233 0.0000001
## 3-1 24.879696 17.222982 32.53641 0.0000000
## 4-1 58.429662 50.772948 66.08638 0.0000000
## 3-2 4.054077 -3.602637 11.71079 0.4918490
## 4-2 37.604044 29.947330 45.26076 0.0000000
## 4-3 33.549966 25.893252 41.20668 0.0000000
Two-Way-ANOVA
value <- sample(1000, 60) # sample 60 values from range 1 to 1000
two_way_data <- data.frame(
carbrand = rep(c("Toyota", "BMW", "Nissan", "Audi"), each = 15),
region = rep(c("US", "EU", "JP", "TW", "KR"), times = 12),
miles = value
)
two_way_result <- aov(miles ~ carbrand * region, data = two_way_data)
summary(two_way_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## carbrand 3 283326 94442 1.550 0.21644
## region 4 143785 35946 0.590 0.67173
## carbrand:region 12 2588977 215748 3.542 0.00126 **
## Residuals 40 2436621 60916
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## 'font', 'title', 'titlefont', 'autosize', 'width', 'height', 'margin', 'paper_bgcolor', 'plot_bgcolor', 'separators', 'hidesources', 'smith', 'showlegend', 'xaxis', 'yaxis', 'ternary', 'scene', 'geo', 'mapbox', 'radialaxis', 'angularaxis', 'direction', 'orientation', 'dragmode', 'hovermode', 'hoverlabel', 'legend', 'annotations', 'shapes', 'images', 'updatemenus', 'sliders', 'calendar', 'barmode', 'bargap', 'mapType'
簡單線性迴歸 - R在fit線性模型上面有非常簡潔的方法,只需要一行就能完成fit的過程並評估成果
kie <- read.csv("Advertising.csv")
Linear_Regression <- lm(Sales ~ TV, data = kie)
summary(Linear_Regression)
##
## Call:
## lm(formula = Sales ~ TV, data = kie)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3860 -1.9545 -0.1913 2.0671 7.2124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.032594 0.457843 15.36 <2e-16 ***
## TV 0.047537 0.002691 17.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099
## F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
predict(Linear_Regression, newdata = data.frame(TV = seq(20, 200, 20)))
## 1 2 3 4 5 6 7
## 7.983326 8.934059 9.884792 10.835525 11.786258 12.736990 13.687723
## 8 9 10
## 14.638456 15.589189 16.539922
羅吉斯迴歸是一種類似簡單線性迴歸的分類器,再經過sigmoid Function的非線性轉換之後 成為一種簡單但是解釋性高的分類器。
CO2$Type <- factor(CO2$Type, levels = c("Mississippi", "Quebec"))
CO2$Type <- as.numeric(CO2$Type) - 1
Logi <- glm(Type ~ uptake, data = CO2, family = binomial(link = "logit"))
summary(Logi)
##
## Call:
## glm(formula = Type ~ uptake, family = binomial(link = "logit"),
## data = CO2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.68565 -0.74112 0.02006 0.66966 2.29454
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.87192 0.87273 -4.437 9.14e-06 ***
## uptake 0.14130 0.02992 4.723 2.32e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 116.449 on 83 degrees of freedom
## Residual deviance: 83.673 on 82 degrees of freedom
## AIC: 87.673
##
## Number of Fisher Scoring iterations: 4
answer <- predict(Logi, newdata = data.frame(uptake = seq(1, 50, 2), type = "response"))
answer > 0.5
## 1 2 3 4 5 6 7 8 9 10 11 12
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 13 14 15 16 17 18 19 20 21 22 23 24
## FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 25
## TRUE
fitting plot
Regularization(正則化)將模型複雜度納入目標變數使整個模型盡量簡單化避免overfitting 我們介紹兩個不同的迴歸,分別是ridge與lasso regression,這兩種迴歸分別納入不同的正則項
regularization term
我們在這裡會介紹引入這兩種正則項的迴歸模型分別是ridge regression與lasso regression,這兩種 模型都被package(glmnet)所包含,透過調整net,但這在這個tutorial 之外,如果同學有興趣可以自己去尋找資源來研究。
kie <- read.csv("Advertising.csv")
kie$TVSQ <- kie$TV ^ 2
kie$RadioSQ <- kie$Radio ^ 2
kie$NewspaperSQ <- kie$Newspaper ^ 2
Simple_Model <- lm(Sales ~ ., data = kie)
Ridge_Model <- glmnet(kie %>% select(-Sales) %>% as.matrix, kie$Sales, alpha = 0, lambda = 10 ^ -1)
Lasso_Model <- glmnet(kie %>% select(-Sales) %>% as.matrix, kie$Sales, alpha = 1, lambda = 10 ^ -1)
cbind(Simple_Model %>% coef, Ridge_Model %>% coef, Lasso_Model %>% coef) %>% as.matrix %>% data.frame %>%
`colnames<-`(c("Simple", "Ridge", "Lasso"))
## Simple Ridge Lasso
## (Intercept) 1.4111232889 2.757062e+00 3.4295760715
## TV 0.0780903298 5.666145e-02 0.0446572119
## Radio 0.1595367310 1.349648e-01 0.1545814521
## Newspaper 0.0101178222 1.233238e-02 0.0000000000
## TVSQ -0.0001123336 -4.156505e-05 0.0000000000
## RadioSQ 0.0007009898 1.113635e-03 0.0005656946
## NewspaperSQ -0.0001225688 -1.609764e-04 0.0000000000
survey <- MASS::survey
test_table <- table(survey$Smoke, survey$Exer)
test_table
##
## Freq None Some
## Heavy 7 1 3
## Never 87 18 84
## Occas 12 3 4
## Regul 9 1 7
chisq.test(test_table)
## Warning in chisq.test(test_table): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: test_table
## X-squared = 5.4885, df = 6, p-value = 0.4828
advan_table <- cbind(test_table[, 1], test_table[, 2] + test_table[, 3])
dimnames(advan_table)[[2]] <- c("Freq", "seldom")
advan_table
## Freq seldom
## Heavy 7 4
## Never 87 102
## Occas 12 7
## Regul 9 8
chisq.test(advan_table)
##
## Pearson's Chi-squared test
##
## data: advan_table
## X-squared = 3.2328, df = 3, p-value = 0.3571
head(mtcars, 10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
target_table <- table(mtcars$carb, mtcars$cyl)
chisq.test(target_table)
## Warning in chisq.test(target_table): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: target_table
## X-squared = 24.389, df = 10, p-value = 0.006632
kmeans 是一種非監督式學習的演算法,可以在不給予外部資訊的情況下試圖在特徵之間嘗試找出適當的分類模式,因為 演算法本身的緣故,執行約100次的kmeans去找出global optimum是一個合理的選擇,若iteration的次數太少,很可能會出現 很不合理的分類方法
clustering <- kmeans(x = iris[, -5], centers = 3, iter.max = 300)
table(iris$Species, clustering$cluster)
##
## 1 2 3
## setosa 0 0 50
## versicolor 47 3 0
## virginica 14 36 0
fviz_cluster(clustering,
data = iris[, -5],
geom = c("point", "text"),
ellipse.type = "norm")
K_test <- sapply(2 : 15, function(k) {
kmeans(x = iris[, -5], centers = k, iter.max = 300)$tot.withinss
})
plot(x = 2 : 15, y = K_test, type = "b", xlab = "number of clusters", ylab = "within groups squared error")
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
PCA(Principal ComponentAnalysis)是一種最為普遍的降維演算法,PCA是一種有損的 降維算法,其目的是在於在可接受的資訊遺失上將高維度的資料減低成低維度的資料, 普遍來說選擇特徵值大於1或者畫出陡坡度並選擇特定維度使資訊損失的程度在研究 或應用上可接受的程度。大部分PCA的科學應用是除去高度重複的維度、視覺化的要求 及加速演算法這幾種需求。
cancer[, 31] <- ifelse(cancer$Cancer.Type == 1, "benign", "malignant")
PCA <- prcomp(cancer[, -31], scale. = TRUE, center = TRUE) # Implementing PCA on the data except the target column
Info <- data.frame(summary(PCA)$importance) %>% transpose()
我們在這裡會選擇降維到兩維,因為兩維比較有利於視覺化
PCA_transformed <- data.frame(PCA$x[, 1 : 2])
PCA_transformed$Cancer.Type <- cancer[, 31]
ggplot(data = PCA_transformed, aes(PC1, PC2, color = Cancer.Type)) + geom_point()
FA(因素分析)是一種常常與PCA擺在一起的降維演算法,雖然這兩者都擁有可以使 資料降維的能力,但是其實這兩種算法的假設與原理不盡相同。PCA假設PC是現有 變數的線性組合而FA則是假設有一群latent variable(潛在變項)的線性組合加上 error term會形成可觀測的變數。Factor analysis在研究上常用來搜尋潛在的抽象 變數,例如衛生或某種心理狀態使研究者可以更好的做因果分析。
FA_result <- fa(cancer[, -31], nfactors = 2, rotate = "varimax", fm = "ml")
FA_result$loadings
##
## Loadings:
## ML1 ML2
## mean.radius 0.996
## mean.texture 0.334
## mean.perimeter 0.998
## mean.area 0.988
## mean.smoothness 0.216 0.642
## mean.compactness 0.561 0.785
## mean.concavity 0.723 0.623
## mean.concave.points 0.857 0.439
## mean.symmetry 0.190 0.603
## mean.fractal.dimension -0.253 0.860
## radius.error 0.705 0.149
## texture.error 0.137
## perimeter.error 0.704 0.199
## area.error 0.757
## smoothness.error -0.197 0.330
## compactness.error 0.253 0.738
## concavity.error 0.234 0.630
## concave.points.error 0.410 0.546
## symmetry.error 0.334
## fractal.dimension.error 0.672
## worst.radius 0.976
## worst.texture 0.312 0.131
## worst.perimeter 0.976
## worst.area 0.952
## worst.smoothness 0.166 0.613
## worst.compactness 0.464 0.739
## worst.concavity 0.573 0.671
## worst.concave.points 0.780 0.497
## worst.symmetry 0.199 0.513
## worst.fractal.dimension 0.836
##
## ML1 ML2
## SS loadings 10.833 7.326
## Proportion Var 0.361 0.244
## Cumulative Var 0.361 0.605
在這個資料爆炸的時代,很多資料都流竄在網頁上頭,這時候就需要一個工具來將他們全部抓取以供後續的分析 使用,網路爬蟲工具就是因應這樣的需求而生的,R裡頭有許多種不同的爬蟲工具,我們在這裡會簡單介紹Rvest這款 原生於tidyverse裡頭的package,其他的工具就需要各位自己去尋找及發掘。以下我們會以一個樂透網站作為介紹, 將長期的樂透資訊抓下來並繪製成bar plot Lottery
Lottery <- vector(mode = "list", length = 110)
for (i in 1 : 110) {
Lottery[[i]] <- paste0("http://www.lotto-8.com/listlto539.asp?indexpage=", i, "&orderby=new") %>%
read_html %>% html_nodes(css = "tr+ tr .auto-style5:nth-child(2) , tr+ tr .auto-style5:nth-child(1)") %>%
html_text %>% data.frame(date = .[seq(1, length(.), 2)], number = .[seq(2, length(.), 2)]) %>%
select(-.)
}
Lottery <- Lottery %>% rbindlist %>% .[, number := stri_trans_general(number, "latin-ascii")]
head(Lottery, 30)
## date number
## 1: 2018/04/14 09 , 22 , 32 , 34 , 39
## 2: 2018/04/13 02 , 08 , 22 , 27 , 39
## 3: 2018/04/12 03 , 10 , 14 , 20 , 23
## 4: 2018/04/11 08 , 16 , 27 , 29 , 30
## 5: 2018/04/10 07 , 13 , 25 , 28 , 32
## 6: 2018/04/09 12 , 27 , 32 , 34 , 36
## 7: 2018/04/07 05 , 06 , 07 , 10 , 20
## 8: 2018/04/06 01 , 15 , 21 , 35 , 38
## 9: 2018/04/05 12 , 16 , 27 , 31 , 37
## 10: 2018/04/04 07 , 20 , 23 , 26 , 38
## 11: 2018/04/03 06 , 08 , 10 , 21 , 32
## 12: 2018/04/02 08 , 14 , 18 , 23 , 32
## 13: 2018/03/31 04 , 09 , 19 , 27 , 28
## 14: 2018/03/30 09 , 14 , 21 , 32 , 33
## 15: 2018/03/29 13 , 19 , 23 , 28 , 31
## 16: 2018/03/28 15 , 23 , 25 , 30 , 36
## 17: 2018/03/27 01 , 06 , 14 , 35 , 38
## 18: 2018/03/26 02 , 07 , 23 , 32 , 39
## 19: 2018/03/24 02 , 14 , 17 , 23 , 24
## 20: 2018/03/23 06 , 21 , 33 , 34 , 38
## 21: 2018/03/22 04 , 06 , 11 , 31 , 36
## 22: 2018/03/21 05 , 08 , 22 , 26 , 36
## 23: 2018/03/20 07 , 08 , 25 , 29 , 32
## 24: 2018/03/19 07 , 16 , 19 , 36 , 37
## 25: 2018/03/17 08 , 14 , 26 , 36 , 39
## 26: 2018/03/16 02 , 15 , 16 , 28 , 37
## 27: 2018/03/15 02 , 04 , 22 , 30 , 32
## 28: 2018/03/14 02 , 21 , 25 , 30 , 31
## 29: 2018/03/13 04 , 06 , 09 , 33 , 36
## 30: 2018/03/12 01 , 02 , 08 , 33 , 35
## date number
number_count <- strsplit(Lottery$number, ",") %>% unlist %>% trimws %>% table %>% as.data.frame
test_vector <- rep(0, 1e6)
benchmark(
"Loop" = {
for (i in seq_along(test_vector)) {
test_vector[i] <- test_vector[i] + 1
}
},
"Vectorization" = {
test_vector <- test_vector + 1
}
)
## test replications elapsed relative user.self sys.self
## 1 Loop 100 6.490 13.299 6.455 0.015
## 2 Vectorization 100 0.488 1.000 0.243 0.220
## user.child sys.child
## 1 0 0
## 2 0 0